Probabilities and Statistics

MSDA - Bootcamp 2025 Summer

KT Wong

Faculty of Social Sciences, HKU

2025-07-30

The materials in this topic are drawn from Stachurski (2016)

Motivation

  • Social processes are not deterministic

    • the “effects” of social causes are difficult to isolate and estimate
  • need a framework for communicating our uncertainty about the inferences we draw in our empirical work

  • probability theory

    • help us do inference on modeling
    • the root of social statistics
  • it might be even helpful for case selection and small-n inference in qualitative research

Basic Concepts

some elements of set theory

  • set: a collection of elements

  • subset: set that is composed entirely of elements of another set

    • e.g. Set A is subset of Set B if every element of A was also an element of B
      • i.e. Set B contains Set A
  • union: the union of two sets contains all the elements that belong to either sets

  • intersection: the intersection of two sets contains only those elements found in both sets

  • complement: the complement of a given set is the set that contains all elements not in the original set

  • disjoint: Two sets are disjoint when their intersection is empty

intersection

intersection

union

union

complement

complement

Basic Concepts

probability space

  • Let \(\Omega\) be a set of possible underlying outcomes - sample space

  • Let \(\omega \in \Omega\) be a particular underlying outcomes

  • Let \(\mathcal{G} \subset \Omega\) be a subset of \(\Omega\) - event

  • \(\mathcal{F}\) be a collection of such subsets \(\mathcal{G} \subset \Omega\)

  • the pair \((\Omega,\mathcal{F})\) forms probability space

Exercises

Consider the universal set \(\Omega = \{1, 2, 3, 4, 5, 6, 7, 8, 9, 10\}\)

Given the sets \(A = \{2, 4, 6, 8, 10\}\) and \(B = \{3, 6, 9\}\)

  • Find the complement of set A

  • Find the complement of set B

  • Calculate the intersection of sets A and B

  • Compute the union of the complements of sets A and B

  • Find the intersection of the complement of set A and B

Basic Concepts

probability measure

  • A probability measure \(\textrm{P}\) maps a set of event(s) \(\mathcal{G} \in \mathcal{F}\) into a scalar number between \(0\) and \(1\)

    • this is the “probability” that event \(A\) happens, denoted by \(\textrm{P}(A)\)
  • The probability measure must satisfy the following properties:

    1. \(\textrm{P}(\emptyset) = 0\)
    2. \(\textrm{P}(\Omega) = 1\)
    3. If \(A_1, A_2, \ldots\) are disjoint, then \(\textrm{P}(\cup_{i=1}^\infty A_i) = \sum_{i=1}^\infty \textrm{P}(A_i)\)

Basic Concepts

Law of Total Probability

  • The probability of event \(A\) can be decomposed into \(n\) parts
    • one part that intersects with \(B_1\), another part that intersects with \(B_2\), and so on

  • we can state the probability formally

\[ \textrm{P}(A) = \textrm{P}(A \cap B_1) + \textrm{P}(A \cap B_2) + \ldots + \textrm{P}(A \cap B_n) \]

Basic Concepts

Conditional Probability

  • Conditional probability statements recognize that some prior information bears on the determination of subsequent probabilities

  • The conditional probability of event \(A\) given event \(B\) is denoted by \(\textrm{P}(A|B)\) and is defined as

\[ \textrm{P}(A|B) = \frac{\textrm{P}(A \cap B)}{\textrm{P}(B)} \]

  • we can use conditional probability statements to derive the law of total probability

\[ \textrm{P}(A) = \textrm{P}(A|B_1)\textrm{P}(B_1) + \textrm{P}(A|B_2)\textrm{P}(B_2) + \ldots + \textrm{P}(A|B_n)\textrm{P}(B_n) \]

  • where \(B_1, B_2, \ldots, B_n\) form a partition of the sample space

Basic Concepts

Bayes’s rule

  • Bayes’ rule can be regarded as a way to reverse conditional probabilities

  • let \(A\) and \(B\) be two events with \(\textrm{P}(B) > 0\)

    • then the Bayes’ rule states that

\[ \textrm{P}(A|B) = \frac{\textrm{P}(B|A)\textrm{P}(A)}{\textrm{P}(B|A)\textrm{P}(A) + \textrm{P}(B|A^C)\textrm{P}(A^C)}\]

Exercises

Suppose you work in a building that has a fire alarm system. The fire alarm is designed to go off when there is a fire, and it’s also known that sometimes the alarm can go off due to smoke from a malfunctioning HVAC system.

  • there is a 1% chance that there is a fire: \(P(Fire) = 0.01\)
  • the alarm system works pretty well and there is a 95% chance it goes off when there is an actual fire
    • \(P(\text{Alarm goes off | Fire}) = 0.95\)
  • there is a 10% chance that the alarm goes off due to smoke without a fire
    • \(P(\text{Alarm goes off | No Fire}) = 0.1\)
  • what’s the probability of actually there being a dangerous fire given that alarm goes off?

Basic Concepts

Independence

Intuition: Information about the outcome of event A doesn’t change the probability of event B happening

  • Two events \(A\) and \(B\) are independent if

\[ P(A \cap B)=P(A)P(B) \]

  • we can deduce that if \(A\) and \(B\) are independent, then

    • \(P(A) = P(A|B)\)
    • \(P(B) = P(B|A)\)
  • When there are more than two events, we say that they are mutually independent if every subset of the events is independent

  • Reminder:

    • pairwise independence does not imply mutual independence

Basic Concepts

random variable

  • A random variable \(X(\omega)\) is a function of the underlying outcome \(\omega \in \Omega\)
    • \(X(\omega)\) has a probability distribution that is induced by the underlying probability measure \(P\) and the function \(X(\omega)\):

\[ \textrm{Prob} (X \in A ) = \int_{\mathcal{G}} P(\omega) d \omega \]

\(\qquad\) where \({\mathcal G}\) is the subset of \(\Omega\) for which \(X(\omega) \in A\)

Probability Distributions

A probability distribution \(\textrm{Prob} (X \in A)\) can be described by its cumulative distribution function (CDF)

\[ F_{X}(x) = \textrm{Prob}\{X\leq x\}. \]

A continuous-valued random variable can be described by density function \(f(x)\) that is related to its CDF by

\[ \textrm{Prob} \{X\in B\} = \int_{t\in B}f(t)dt \]

\[ F(x) = \int_{-\infty}^{x}f(t)dt \]

Probability Distributions

  • For a discrete-valued random variable

  • the number of possible values of \(X\) is finite or countably infinite

  • we replace a density with a probability mass function (pmf), a non-negative sequence that sums to one

  • we replace integration with summation in the formula that relates a CDF to a probability mass function (pmf)

  • let us discuss some common distributions for illustrations

Common distributions

Discrete distributions

A discrete distribution is defined by a set of numbers \(S = \{x_1, \ldots, x_n\}\) and a probability mass function (pmf) on \(S\), which is a function \(p\) from \(S\) to \([0,1]\) with the property

\[ \sum_{i=1}^n p(x_i) = 1 \]

  • a random variable \(X\) has distribution \(p\) if \(X\) takes value \(x_i\) with probability \(p(x_i)\)

\[ \mathbb P\{X = x_i\} = p(x_i) \quad \text{for } i= 1, \ldots, n \]

Common distributions

Discrete distributions

  • The mean or expected value of a random variable \(X\) with distribution \(p\) is

\[ \mathbb{E}[X] = \sum_{i=1}^n x_i p(x_i) \]

  • Expectation is often called the first moment of the distribution in statistics
    • we call it the mean of the distribution (represented by) \(\mu\)
  • The variance of \(X\) is defined as

\[ \mathbb{V}[X] = \sum_{i=1}^n (x_i - \mathbb{E}[X])^2 p(x_i) \]

  • Variance is often called the second central moment of the distribution in statistics

Common distributions

Discrete distributions

  • The cumulative distribution function (CDF) of \(X\) is defined by

\[ F(x) = \mathbb{P}\{X \leq x\} = \sum_{i=1}^n \mathbb 1\{x_i \leq x\} p(x_i) \]

  • Here \(\mathbb 1\{ \textrm{statement} \} = 1\) if “statement” is true and zero otherwise

  • Hence the second term takes all \(x_i \leq x\) and sums their probabilities

Common distributions

the uniform distribution

  • \(p(x_i) = 1/n\) for all \(i\)
    • the mean is \((n+1)/2\)
    • the variance is \((n^2 - 1)/12\)

Bernoulli distribution

  • the Bernoulli distribution on \(S = \{0,1\}\), which has pmf:

\[ p(i) = \theta^i (1 - \theta)^{1-i} \qquad (i = 0, 1) \]

  • \(\theta \in [0,1]\) is a parameter
    • \(p(1) = \theta\) means that the trial succeeds (takes value 1) with probability \(\theta\)
    • the mean is \(\theta\); the variance is \(\theta(1-\theta)\)

Common distributions

Binomial distribution

the binomial distribution on \(X=\{0, \ldots, n\}\), which has pmf

\[ p(x) = \binom{n}{x} \theta^x (1-\theta)^{n-x} \]

  • \(\theta \in [0,1]\) is a parameter
    • the mean is \(n \theta\)
    • the variance is \(n \theta (1-\theta)\)

Common distributions

Poisson distribution

The Poisson distribution on \(X = \{0, 1, \ldots\}\) with parameter \(\lambda > 0\) has pmf

\[ p(x) = \frac{\lambda^x}{x!} e^{-\lambda} \]

  • The interpretation of \(p(x)\) is: the probability of \(i\) events in a fixed time interval, where the events occur independently at a constant rate \(\lambda\).
    • the mean is \(\lambda\) and the variance is \(\lambda\)

Common distributions

Normal distribution

the most famous distribution is the normal distribution, which has density

\[ p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

  • it has two parameters, \(\mu \in \mathbb R\) and \(\sigma \in (0, \infty)\)
    • the mean is \(\mu\) and the variance is \(\sigma^2\)

Common distributions

Continuous distributions

A continuous distribution is represented by a probability density function (pdf), which is a function \(p\) over \(\mathbb R\) such that \(p(x) \geq 0\) for all \(x\) and

\[ \int_{-\infty}^\infty p(x) dx = 1 \]

We say that random variable \(X\) has distribution \(p\) if

\[ \mathbb P\{a < X < b\} = \int_a^b p(x) dx \]

for all \(a \leq b\)

Common distributions

Lognormal distribution

The lognormal distribution is a distribution on \(\left(0, \infty\right)\) with density

\[ p(x) = \frac{1}{\sigma x \sqrt{2\pi}} \exp \left(- \frac{\left(\log x - \mu\right)^2}{2 \sigma^2} \right) \]

  • It has two parameters, \(\mu\) and \(\sigma\)
    • the mean is \(\exp\left(\mu + \sigma^2/2\right)\)
    • the variance is \(\left[\exp\left(\sigma^2\right) - 1\right] \exp\left(2\mu + \sigma^2\right)\)

Common distributions

Gamma distribution

The gamma distribution is a distribution on \(\left(0, \infty\right)\) with density

\[ p(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} \exp(-\beta x) \]

  • It has two parameters, \(\alpha > 0\) and \(\beta > 0\)
    • the mean is \(\alpha / \beta\)
    • the variance is \(\alpha / \beta^2\)

Common distributions

Beta distribution

The beta distribution is a distribution on \((0, 1)\) with density

\[ p(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} \]

where \(\Gamma\) is the gamma function

  • it has two parameters, \(\alpha > 0\) and \(\beta > 0\)
    • the mean is \(\alpha / (\alpha + \beta)\)
    • the variance is \(\alpha \beta / (\alpha + \beta)^2 (\alpha + \beta + 1)\)

Bivariate Probability Distribution

Two discrete random variables

  • Let \(X,Y\) be two discrete random variables that take values:

\[ X\in\{0,\ldots,I-1\} \qquad Y\in\{0,\ldots,J-1\} \]

  • their joint distribution is described by a matrix

\[ F_{I\times J}=[f_{ij}]_{i\in\{0,\ldots,I-1\}, j\in\{0,\ldots,J-1\}} \]

  • whose elements are

\[ f_{ij}=\textrm{P}\{X=i,Y=j\} \geq 0 \]

  • where \(\sum_{i}\sum_{j}f_{ij}=1\)

Bivariate Probability Distribution

Two discrete random variables

The joint distribution induce marginal distributions

\[ \textrm{P}\{X=i\}= \sum_{j=0}^{J-1}f_{ij} = \mu_i, \quad i=0,\ldots,I-1 \]

\[ \textrm{P}\{Y=j\}= \sum_{i=0}^{I-1}f_{ij} = \nu_j, \quad j=0,\ldots,J-1 \]

Bivariate Probability Distribution

Two discrete random variables

  • for example, let a joint distribution over \((X,Y)\) be

\[ F_{X,Y} = \left[ \begin{matrix} P(X=0, Y=0) & P(X=0, Y=1)\\ P(X=1, Y=0) & P(X=1, Y=1) \end{matrix} \right] = \left[ \begin{matrix} .25 & .1\\ .15 & .5 \end{matrix} \right] \]

  • the marginal distributions are:
\[\begin{aligned} \textrm{P} \{X=0\}&=\\ \textrm{P}\{X=1\}&=\\ \textrm{P}\{Y=0\}&=\\ \textrm{P}\{Y=1\}&= \end{aligned}\]

Summary Statistics

Suppose we have an observed distribution with values \(\{x_1, \ldots, x_n\}\)

  • The sample mean of this distribution is defined as

\[ \bar x = \frac{1}{n} \sum_{i=1}^n x_i \]

  • The sample variance is defined as

\[ \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 \]

LLN and CLT

  • two of the most important results in probability and statistics

    1. the law of large numbers (LLN)
    2. the central limit theorem (CLT)
  • Let \(X_1, .\dots, X_n\) be independent and identically distributed scalar random variables, with common distribution \(F\) and common mean \(\mu\) and variance \(\sigma^2\)

  • Law of large numbers

\[ P(|\bar{X}_n- \mu | \geq \varepsilon) \rightarrow 0 \text{ as } n \rightarrow \infty, \quad \forall \varepsilon > 0 \]

  • Central limit theorem

\[ \sqrt{n}(\bar{X}_n- \mu) \stackrel{d}{\to} N(0, \sigma^2)\quad \text{ as } \quad n \to \infty \]

References

Stachurski, John. 2016. A Primer in Econometric Theory. Cambridge, Massachusetts: The MIT Press.